Chapter 7 Chinese Text Processing
In this chapter, we will discuss one of the most important issues in Chinese language/text processing, i.e., word segmentation. When we discuss tokenization in @ref{tokenization}, it is easy to do the word tokenization in English as the word boundaries in English are more clearly delimited by whitespaces. Chinese, however, does not have whitespaces between characters, which leads to a serious problem for word tokenization.
This chapter is devoted to Chinese text processing. We will look at the issues of word tokenization and talk about the most-often used library, jiebaR, for Chinese word segmentation. Also, we will include several case studies on Chinese text processing.
library(tidyverse)
library(tidytext)
library(quanteda)
library(stringr)
library(jiebaR)
library(readtext)7.1 Chinese Word Segmenter jiebaR
First, you haven’t installed the library jiebaR, please install:
Now let us take a look at a quick example.
text <- "綠黨桃園市議員王浩宇爆料,指民眾黨不分區被提名人蔡壁如、黃瀞瑩,在昨(6)日才請辭是為領年終獎金。台灣民眾黨主席、台北市長柯文哲7日受訪時則說,都是按流程走,不要把人家想得這麼壞。"
seg1 <- worker()
segment(text, jiebar = seg1)## [1] "綠黨" "桃園市" "議員" "王浩宇" "爆料" "指民眾"
## [7] "黨" "不" "分區" "被" "提名" "人"
## [13] "蔡壁如" "黃" "瀞" "瑩" "在昨" "6"
## [19] "日" "才" "請辭" "是" "為領" "年終獎金"
## [25] "台灣民眾" "黨" "主席" "台北" "市長" "柯文"
## [31] "哲" "7" "日" "受訪" "時則" "說"
## [37] "都" "是" "按" "流程" "走" "不要"
## [43] "把" "人家" "想得" "這麼" "壞"
To segment a text, you first initialize a segmenter seg1 using worker() and use this segmenter to segment() texts.
There are many different parameters you can specify when you initialize the segmenter worker(). You may get more detail via the documentation ?worker. Some of the important arguments include:
user = ...: This argument is to specify the path to a user-defined dictionarystop_word = ...: This argument is to specify the path to a stopword listsymbol = FALSE: Whether to return symbols (default is FALSE)bylines = FALSE: Whether to return each word one line at a time
From the above examples, it is clear to see that some of the words are not correctly identified by the current segmenter: 民眾黨, 不分區, 黃瀞瑩, 柯文哲. It is always recommended to include a user-defined dictionary when doing the word segmentation because different corpora may have their own unique vocabulary.
## [1] "綠黨" "桃園市" "議員" "王浩宇" "爆料" "指"
## [7] "民眾黨" "不分區" "被" "提名" "人" "蔡壁如"
## [13] "黃瀞瑩" "在昨" "6" "日" "才" "請辭"
## [19] "是" "為領" "年終獎金" "台灣" "民眾黨" "主席"
## [25] "台北" "市長" "柯文哲" "7" "日" "受訪"
## [31] "時則" "說" "都" "是" "按" "流程"
## [37] "走" "不要" "把" "人家" "想得" "這麼"
## [43] "壞"
The format of the user-defined dictionary is one word per line. Also, the default encoding of the dictionary is UTF-8. Please note that in Windows, the default encoding of a txt file created by Notepad may not be UTF-8.
Creating a user-defined dictionary may take a lot of time. You may consult 搜狗詞庫, which includes many domain-specific dictionaries created by others. However, it should be noted that the format of the dictionaries is .scel. You may need to convert the .scel to .txt before you use it in jiebaR. To do the coversion automatically, please consult the library cidian.
When you initialize the segmenter, you can also specify a stopword list, i.e., words you do not need to include in the later analyses.
## [1] "綠黨" "桃園市" "議員" "王浩宇" "爆料" "指民眾"
## [7] "黨" "不" "分區" "被" "提名" "人"
## [13] "蔡壁如" "黃" "瀞" "瑩" "在昨" "6"
## [19] "才" "請辭" "為領" "年終獎金" "台灣民眾" "黨"
## [25] "主席" "台北" "市長" "柯文" "哲" "7"
## [31] "受訪" "時則" "說" "按" "流程" "走"
## [37] "不要" "把" "人家" "想得" "這麼" "壞"
So far we did not see the parts-of-speech tag provided by the word segmenter. If you need the tags of the words, you need to specify this need when you initialize the worker().
seg4 <- worker(type = "tag", user = "demo_data/dict-ch-user-demo.txt", stop_word = "demo_data/stopwords-ch-demo.txt")
segment(text, seg4)## n ns n x n n x
## "綠黨" "桃園市" "議員" "王浩宇" "爆料" "指" "民眾黨"
## x p v n x x x
## "不分區" "被" "提名" "人" "蔡壁如" "黃瀞瑩" "在昨"
## x d v x n x x
## "6" "才" "請辭" "為領" "年終獎金" "台灣" "民眾黨"
## n ns n x x v x
## "主席" "台北" "市長" "柯文哲" "7" "受訪" "時則"
## zg p n v df p n
## "說" "按" "流程" "走" "不要" "把" "人家"
## x r a
## "想得" "這麼" "壞"
The following table lists the annotations of the POS tagsets used in jiebaR:
You can check the dictionaries being used in your current enviroment:
## [1] "/Library/Frameworks/R.framework/Versions/3.5/Resources/library/jiebaRD/dict"
## [1] "backup.rda" "hmm_model.utf8" "hmm_model.zip" "idf.utf8"
## [5] "idf.zip" "jieba.dict.utf8" "jieba.dict.zip" "model.rda"
## [9] "README.md" "stop_words.utf8" "user.dict.utf8"
scan(file="/Library/Frameworks/R.framework/Versions/3.5/Resources/library/jiebaRD/dict/stop_words.utf8",
what=character(),nlines=50,sep='\n',
encoding='utf-8',fileEncoding='utf-8')## [1] "\"" "." "。" "," "、" "!" "?" ":" ";" "`" "﹑" "•"
## [13] """ "^" "…" "‘" "’" "“" "”" "〝" "〞" "~" "\\" "∕"
## [25] "|" "¦" "‖" "— " "(" ")" "〈" "〉" "﹞" "﹝" "「" "」"
## [37] "‹" "›" "〖" "〗" "】" "【" "»" "«" "』" "『" "〕" "〔"
## [49] "》" "《"
When we use segment() as a tokenization method in the unnest_tokens(), it is very important to specify bylines = TRUE in worker(). This setting would make sure that segment() takes a list of text vectors as input and return a list of word vectors as output.
NB: When bylines = FALSE, segment() returns a vector.
seg_byline_1 <- worker(bylines = T)
seg_byline_0 <- worker(bylines = F)
(text_tag_1 <- segment(text, seg_byline_1))## [[1]]
## [1] "綠黨" "桃園市" "議員" "王浩宇" "爆料" "指民眾"
## [7] "黨" "不" "分區" "被" "提名" "人"
## [13] "蔡壁如" "黃" "瀞" "瑩" "在昨" "6"
## [19] "日" "才" "請辭" "是" "為領" "年終獎金"
## [25] "台灣民眾" "黨" "主席" "台北" "市長" "柯文"
## [31] "哲" "7" "日" "受訪" "時則" "說"
## [37] "都" "是" "按" "流程" "走" "不要"
## [43] "把" "人家" "想得" "這麼" "壞"
## [1] "綠黨" "桃園市" "議員" "王浩宇" "爆料" "指民眾"
## [7] "黨" "不" "分區" "被" "提名" "人"
## [13] "蔡壁如" "黃" "瀞" "瑩" "在昨" "6"
## [19] "日" "才" "請辭" "是" "為領" "年終獎金"
## [25] "台灣民眾" "黨" "主席" "台北" "市長" "柯文"
## [31] "哲" "7" "日" "受訪" "時則" "說"
## [37] "都" "是" "按" "流程" "走" "不要"
## [43] "把" "人家" "想得" "這麼" "壞"
## [1] "list"
## [1] "character"
7.2 Case Study 1: Word Frequency and Wordcloud
# loading the corpus
# NB: this may take some time
apple_df <- readtext("demo_data/applenews10000.tar.gz") %>%
as_tibble() %>%
filter(text !="") %>%
mutate(doc_id = row_number())
apple_df